[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

timmoon10 · 2026-01-24T06:40:00Z

Description

This PR adds a grouped linear op, which can be used in the grouped MLP block in Mixture-of-Experts models. It also adds an experimental fused operation for a grouped MLP block, using a CuTe DSL kernel that computes an MXFP8 grouped GEMM and SwiGLU.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add a grouped linear operation
Add a post-scaled SwiGLU op and add support for interleaving SwiGLU gate and linear units
Add a fused operation for grouped MLP

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Test is too permissive since the test should still be failing. The weights are not properly interleaved yet. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps

_{9 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T22:23:51Z

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

+                quantizer=fc2_input_quantizers[group_idx],
+                requires_grad=False,
+                with_gemm_swizzled_scales=True,


Incorrect grad-required flags

In ForwardGroupedMLP_CuTeGEMMSwiGLU_MXFP8.fuser_forward, swiglu_ctx.input_requires_grad and swiglu_ctx.extra_input_requires_grad are set to True unconditionally (and input_requires_grad is set to requires_grad unconditionally). This will make ScaledSwiGLU.fuser_backward compute grad_input and grad_extra_input even when neither input_ nor scales require grads, which violates autograd semantics and can raise (e.g., scales.detach() passed into the fused kernel, but extra_input_requires_grad=True forces a gradient).

This should be set based on the actual requirements:

input_requires_grad = input_.requires_grad

swiglu_ctx.extra_input_requires_grad = scales.requires_grad

and for FC weights, check each parameter’s requires_grad (not just weight0).

greptile-apps · 2026-02-09T22:23:52Z

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py

+    # Return immediately if fused kernel is not supported
+    if not BackwardGroupedMLP_CuTeGEMMDSwiGLU_MXFP8.is_supported():
+        return ops
+
+    # Check if recipe is supported
+    if recipe is None:
+        return ops
+    if not recipe.mxfp8():
+        return ops
+
+    # Scan through ops, fusing if possible
+    out = []
+    window, ops = ops[:3], ops[3:]
+    while len(window) == 3:
+
+        # Check if window matches pattern
+        matches_pattern = True
+        if not (
+            isinstance(window[0], GroupedLinear)
+            and isinstance(window[1], ScaledSwiGLU)
+            and isinstance(window[2], GroupedLinear)
+        ):
+            matches_pattern = False
+        elif window[0].has_bias or window[2].has_bias:
+            matches_pattern = False
+        elif window[0].num_groups != window[2].num_groups:
+            matches_pattern = False
+        elif (
+            window[0].in_features % 256 != 0
+            or window[0].out_features % 256 != 0
+            or window[2].in_features % 256 != 0
+            or window[2].out_features % 256 != 0
+        ):
+            matches_pattern = False
+        elif window[1].glu_interleave_size != 32:
+            matches_pattern = False
+
+        if matches_pattern:
+            # Construct fused op if window matches pattern
+            op = BackwardGroupedMLP_CuTeGEMMDSwiGLU_MXFP8(
+                fc1=window[0],


Broken fusion window scan

Both fuse_backward_ops and fuse_forward_ops have a window/shift loop that can drop or reorder ops when the pattern doesn’t match. In the non-matching branch you do out.extend(window[:-2]); window = window[-2:] and then immediately do out.extend(window[:-3]) (which is a no-op for a 2-element window) before refilling. This causes the scan to advance by 1 op in some cases and by 2 in others, and it never emits window[-1] until the very end. For sequences like [A,B,C,D] where [A,B,C] doesn’t match but [B,C,D] would (or vice versa), this loop will not correctly consider all 3-op windows and can produce an incorrect fused op list.

This needs a standard sliding-window approach (advance by 1 when not matching; replace 3->1 when matching) to ensure no ops are skipped or duplicated.

I think your reading of the code is incorrect, although I would take suggestions for improving clarity. The loop condition guarantees the window size is 3 at the beginning of each iteration. We either fuse the window or eject the first op in the window, and then we refill back up to window size 3.

The only hint of an edge case I can see is if we perform a fusion, and that fused op can participate in further fusions. Then we might want to rewind the sliding window so that we reexamine the fused op in each window position. However, we know that the fused op is final, and we can safely advance the window past it.

transformer_engine/pytorch/ops/basic/grouped_linear.py

transformer_engine/pytorch/ops/basic/swiglu.py

Signed-off-by: Tim Moon <tmoon@nvidia.com>

@ptrendx

Review suggestion from @ptrendx. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

timmoon10 and others added 30 commits January 7, 2026 00:15

Naive implementation of grouped linear op

5175aad

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Use grouped GEMM tex functions

5ffd57e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Support quantized compute

2ee42da

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug test failures with MXFP8 or NVFP4 params

93e71df

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add multiply op

fdddc47

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Bug fixes

b448a17

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Expose option for custom op fusions

3f38897

Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add tests for custom ops

a359b67

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5f7204f

for more information, see https://pre-commit.ci

Fix linter warnings and numerical test failures

8ddb8ce

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Tweak pattern matching logic with fixed window sizes

cfc2617

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0ce5dfb

for more information, see https://pre-commit.ci

Merge branch 'main' into tmoon/custom-fused-ops

9bf5843

Use TF32 tols in fused op tests

4992903

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Review suggestion from @greptile-apps

9ab7751

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into tmoon/custom-fused-ops

a086d81

Merge branch 'main' into tmoon/grouped-linear-op

f05f7a8

Fix linter warnings

9348138

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5366729

for more information, see https://pre-commit.ci

Merge branch 'tmoon/grouped-linear-op' into tmoon/cute-gemm-swiglu

1b0b229

Merge branch 'tmoon/custom-fused-ops' into tmoon/cute-gemm-swiglu

3bbe881

Initial impl of fused op for grouped MLP

321646e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Import group GEMM+SwiGLU kernel

e137451

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into tmoon/cute-gemm-swiglu

11da59d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add unit test for grouped MLP op

cb728bb

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Call fused group GEMM + SwiGLU kernel

e7459cc

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug test failures

b15ca0d

Test is too permissive since the test should still be failing. The weights are not properly interleaved yet. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Get test to not pass trivially

3da2c17

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Handle interleaving for SwiGLU

0270eb1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix numeric tests, except for probs grad

0b09790

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This comment was marked as outdated.

Sign in to view

timmoon10 and others added 4 commits February 5, 2026 02:18

Update forward kernel API

b02df2b

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add fused op for backward grouped MLP

49978c6

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug correctness error in CuTe DSL kernel

790ca14

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f0b8366

for more information, see https://pre-commit.ci

This comment was marked as resolved.

Sign in to view

Avoid unnecessary weight transpose

504f389

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This comment was marked as resolved.

Sign in to view

Fix linter warnings and comment typos

47f942f

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into tmoon/cute-gemm-swiglu

ccb091e

This comment was marked as outdated.

Sign in to view

Fix API change in backward kernel

73d5913

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This comment was marked as resolved.

Sign in to view

Fix dtype mismatch

749f110

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This comment was marked as resolved.

Sign in to view

Fix incorrect FP8 dtype

89b3283

Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Add ops for MoE grouped MLP

61e51d7

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 mentioned this pull request Feb 9, 2026

[PyTorch] Add ops for MoE grouped MLP #2664

Merged

13 tasks

timmoon10 added 7 commits February 11, 2026 23:24

Move testing utility functions to util submodule

113a163

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'tmoon/grouped-linear-op' into tmoon/cute-gemm-swiglu

1466289

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Tweak docs

0c2b1bc

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Change order of tensor compatibility checks in noop_cat

78e1800

Review suggestion from @ptrendx. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into tmoon/grouped-linear-op

8888100

Add support for GLU interleaving in clamped SwiGLU

da74a35

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'tmoon/grouped-linear-op' into tmoon/cute-gemm-swiglu

9b97348

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into tmoon/cute-gemm-swiglu

329443c

Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

Are you sure you want to change the base?

[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

Conversation

timmoon10 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timmoon10 commented Jan 24, 2026 •

edited

Loading

timmoon10 Feb 12, 2026 •

edited

Loading